Parameter estimation differences in the least squares and maximum likelihood paradigms are
the formulation of the objective function. In the least squares paradigm, the objective function is simply the
sum of squares of residuals. We can regard the boundary restriction of the dependent variable as the linear
constraints. Parameter estimation can be specified as a quadratic programming problem (QP):
(Vanderbei, 2008)
\begin{align*}
Minimize \qquad & f\left( \beta \right)=\sum\limits_{i=1}^{n}{\sum\limits_{t=1}^{{{T}_{i}}}{{{\left[ \left(
{{y}_{it}}-{{{\bar{y}}}_{i}} \right)-\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \right]}^{2}}}} \\
Subject \quad to \qquad
&\qquad \left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \le b \\
&\quad -\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \le -a,
\end{align*}
We can modify the objective function slightly by providing a distributional assumption
to the demeaned dependent variable \left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right) in
(2.1)
\begin{equation}
\left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right)\sim TN\left[ \left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta ,
{{\sigma }^{2}};p_{1},q_{1} \right],
\tag{2.2}
\end{equation}
Given this distributional assumption, the objection function can be specified as
\begin{equation*}
Maximize \qquad \log L\equiv -\sum\limits_{i=1}^{n}{\sum\limits_{t=1}^{{{T}_{i}}}{{{D}_{it}}-\frac{1}
{2{{\sigma }^{2}}}}}{{\left[ \left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right)-\left( {{x}_{it}}-{{{\bar{x}}}_{i}}
\right)\beta \right]}^{2}},
\end{equation*}
From the perspective of the likelihood paradigm, the two objective functions above are all
problematic, since the panel regression is incorrectly specified in the first place.9
To see why this is so, we first assume that the dependent variable {{y}_{it}} is distributed as truncated normal
\begin{equation*}
{{y}_{it}}\sim TN\left( {{\mu }_{i}},\sigma ;a,b \right),
\end{equation*}
If we want to specify a conceptually equivalent model as the panel regression, we can use the
individual-level dependent variable to estimate district-level location parameters {{\mu }_{i}}, and then perform
the demeaning operation to derive the within-groups regression.11 In this scenario,
the dependent variable can be specified as
\begin{equation*}
\left( {{y}_{it}}-{\hat{\mu }_{i}} \right)\sim TN\left( {{x}^{*}_{it}}\beta ,{{\sigma }^{2}};p_{2},q_{2} \right),
\end{equation*}
Applying a maximum likelihood estimation,we can derive the objective function as
\begin{equation*}
Maximize \qquad \log L\equiv -\sum\limits_{i=1}^{n}{\sum\limits_{t=1}^{{{T}_{i}}}{\left\{ {{D}_{it}}-\frac{1}
{2{{\sigma }^{2}}}{{\left[ \left( {{y}_{it}}-{{{\hat{\mu }}}_{i}} \right)-{{x}_{it}}\beta \right]}^{2}} \right\}}},
\end{equation*}
9 In contemporary statistical science, the likelihood theory is a crucial paradigm of inference for data analysis (Royall, 1997:xiii). It provides a unifying approach of statistical modeling to both frequentists and Bayesians with the criterion of maximum likelihood (Azzalini, 1996). The rapid development of political methodology in the last two decades has also witnessed the establishment of the likelihood paradigm in the scientific study of politics (King, 1998). As a model of inference, the fundamental assumption of the likelihood theory is the likelihood principle, which states that "all evidence, which is obtained from an experiment, about an unknown quantity \theta, is contained in the likelihood function of \theta for the given function." (Berger and Wolpert,1984:vii) In other words, given the fact that the likelihood function is defined by the probability density (or mass) function, we must make a distributional assumption of the dependent variable to derive a likelihood function. The plausibility of such a distributional assumption is therefore vital to the validity of the statistical inference.
10 When b-{{\mu }_{i}}={{\mu }_{i}}-a, the normal distribution is evenly truncated at both ends. When \left( a,b \right)\to \left( -\infty ,\infty \right), the variable is not truncated at all. Both situations rarely occur when the dependent variable is distributed as truncated normal.
11 This involves a two-stage procedure. In the first stage, {\hat{\mu }_{i}} is estimated by {{\mu }_{it}} without covariates. In the next stage, we take {\hat{\mu }_{i}} as the district-level property and subtract it to derive complete within-groups deviation.